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ABSTRACT 

In  this  paper  we  describe  and  demonstrate  a  compiler  and  runtime  sup¬ 
port  mechanism.  The  methods  presented  here  are  capable  of  solving  a  wide 
range  of  sparse  and  unstructured  problems  in  scientific  computing.  The 
compiler  takes  as  input  a  Fortran  77  program  enhanced  with  specifications 
for  distributing  data,  and  the  compiler  outputs  a  message  passing  program 
that  runs  on  a  distributed  memory  computer.  The  runtime  support  for  this 
compiler  is  a  library  of  primitives  designed  to  efficiently  support  irregular 
patterns  of  distributed  array  accesses  and  irregular  distributed  array  parti¬ 
tions.  We  present  a  variety  of  Intel  iPSC/860  performance  results  obtained 
through  the  use  of  this  compiler. 
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1  Introduction 


During  the  past  few  years,  a  number  of  researchers  have  proposed  integrat¬ 
ing  runtime  optimization  methods  into  compilers  for  distributed  memory 
multiprocessors.  These  optimizations  are  essential  in  scientific  codes  that 
include  sparse  matrix  solvers,  or  in  programs  that  solve  partial  differential 
equations  using  adaptive  and  unstructured  meshes.  We  first  identified  a 
set  of  relevant  numerical  codes  that  required  runtime  optimizations.  After 
identifying  this  set,  we  performed  extensive  experimental  research  on  these 
codes.  The  results  of  our  experiments  not  only  identified  the  major  perfor¬ 
mance  bottlenecks  in  these  codes  but  also  helped  us  develop  a  rich  set  of 
optimizations  useful  and  essential  to  generating  reasonably  efficient  code  for 
this  class  of  problems  on  distributed  memory  machines.  Once  we  developed 
a  collection  of  run  time  optimizations,  we  built  a  compiler  that  identifies 
irregular  computations  and  performs  transformations  to  enhance  the  code. 

The  compiler  takes  as  input  a  simplified  Fortran  77  program  enhanced 
with  specifications  for  distributing  data,  it  outputs  a  message  passing  For¬ 
tran  program  for  the  Intel  iPSC/860  parallel  computer.  The  compiler  con¬ 
sists  of  two  distinct  layers.  The  bottom  layer  is  a  library  of  runtime  proce¬ 
dures  (Parti  -  Parallel  Automated  Runtime  Toolkit  at  ICASE)  designed  to 
efficiently  support  irregular  patterns  of  distributed  array  accesses.  The  top 
layer  is  a  compiler  that  carries  out  program  transformations  and  embeds 
the  Parti  procedures.  The  Parti  procedures  support  a  variety  of  operations 
that  include  off  processor  data  fetches  ,  off  processor  store  updates  on  reduc¬ 
tion  operations  performed  on  global  data  structures  and  storage  of  non-local 
data.  Parti  also  supports  non-uniform  distributed  array  partitions  in  which 
each  distributed  array  element  can  be  assigned  to  an  arbitrary  processor. 
A  multicomputer  program  is  generated  in  which  all  distributed  memory  ac¬ 
cesses  are  carried  out  using  embedded  procedures. 

It  must  be  emphasized  that  the  goal  of  this  project  is  not  to  develop  a 
production  quality  compiler,  but  to  demonstrate  that  run  time  optimizations 


can  be  automatically  and  efficiently  generated  by  a  compiler.  Most  of  the 
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described  in  section  2.  Section  3  describes  the  Parti  run  time  primitives 
that  have  been  implemented  and  incorporated  in  the  compiler.  In  section 
4  we  describe  the  ARF  language  and  the  overall  compiler  strategy  that 
demonstrates  the  interaction  between  the  two  layers  of  the  ARF  compiler. 
We  describe  the  compiler  in  the  context  of  two  code  examples.  These  exam¬ 
ples  are  written  in  ARF  and  translated  to  iPSC/860  code  by  our  compiler. 
In  Section  5  we  report  experimental  numbers  for  the  codes  compiled  by  the 
ARF  compiler.  In  Section  6  we  describe  the  relationship  between  our  work 
and  other  related  research  projects  in  the  area  and  we  conclude  in  Section  7. 

2  Distributed  Memory  Inspectors  and  Executors 

In  distributed  memory  machines,  large  data  arrays  need  to  be  partitioned 
between  local  memories  of  processors.  These  partitioned  data  arrays  are 
called  distributed  arrays.  We  follow  the  usual  practice  of  assigning  long 
term  storage  of  distributed  array  elements  to  specific  memory  locations  in 
the  machine.  Non-local  reads  require  that  a  processor  fetch  a  copy  of  that 
element  from  the  memory  of  the  processor  in  which  that  array  element  is 
stored.  Alternately,  a  processor  may  need  to  store  a  value  in  a  non-local  dis¬ 
tributed  array  element  requiring  the  processor  to  write  to  non-local  memory. 
An  issue  that  arises  at  this  point  is  where  does  a  processor  store  copies  of 
off-processor  data.  Due  to  the  irregular  nature  of  the  access  pattern,  it  is  not 
efficient  to  store  the  elements  in  temporary  arrays  or  overlap  areas  proposed 
by  Gerndt  [12].  Both  these  storage  schemes  result  in  large  wastage  of  mem¬ 
ory.  We  store  local  copies  of  off-processor  distributed  array  elements  in  hash 
tables  called  hashed  caches.  Hash  tables  result  in  less  wastage  of  memory 
and  quick  access  of  off-processor  data.  Run  time  primitives  are  implemented 
to  manage  the  hashed  caches.  These  primitives  initialize  the  hashed  caches, 
store  and  retrieve  data  from  them  and  flush  the  hashed  caches  when  appro¬ 
priate.  During  program  execution,  a  hash  table  records  off-processor  fetches 
and  stores.  We  are  consequently  able  to  recognize  when  more  than  one  ref¬ 
erence  is  being  made  to  the  same  off-processor  distributed  array  element,  so 
that  only  one  copy  of  that  element  need  be  fetched  or  stored. 

In  distributed  memory  MIMD  architectures,  there  is  typically  a  non¬ 
trivial  communications  latency  or  startup  cost  [8].  As  an  optimization  we 
block  messages  in  order  to  increase  the  message  size  and  reduce  the  number 
of  messages.  This  optimization  can  be  achieved  by  precomputing  what  data 
each  processor  needs  to  send  and  receive.  The  preprocessing  needed  to  per- 


2 


Each  processor  P: 


-  Preprocesses  its  own  loop  iterations 

-  Records  off-prccessor  fetches  and  stores  in  hashed  cache 

-  Finds  send/receive  calls  required  for  data  exchange 

1.  P  generates  list  of  all  off-processor  data  to  be  fetched 

2.  P  sends  messages  to  other  processors  requesting  copies  of 
required  data 

3.  Other  processors  tell  P  which  data  to  send 

4.  Send/Receive  pairs  generated  and  stored 


Figure  1:  Inspector  For  Parallel  Loop  on  Distributed  Memory  Multiproces¬ 
sor 

form  this  optimization  results  in  the  generation  of  an  inspector  loop.  Figure 
1  describes  the  form  of  the  inspector  loop  that  is  generated  assuming  the 
original  loop  is  parallel  and  thus  blocking  messages  is  legal.  The  distribution 
of  parallel  loop  indices  to  processors  determines  where  computations  are  to 
be  performed.  We  assume  that  all  distributed  arrays  referenced  have  been 
defined  and  initialized  and  that  loop  iterations  have  been  partitioned  among 
processors. 

During  the  inspector  phase,  we  carry  out  a  set  of  interprocessor  com¬ 
munications  that  allows  us  to  anticipate  exactly  which  send  and  receive 
communication  calls  each  processor  must  execute  prior  to  executing  the 
loop.  By  contrast,  individual  fetches  and  stores  carried  out  during  the  ac¬ 
tual  computation  would  result  in  expensive,  inefficient  and  awkward  code 
[23].  For  example,  in  such  a  case  processor  A  might  obtain  the  contents 
of  a  distributed  array  element  which  is  not  on  A  by  sending  a  message  to 
processor  B  associated  with  the  array  element.  Processor  B  would  have  to 
be  programmed  to  anticipate  a  request  of  this  type,  to  satisfy  the  request 
and  to  return  a  responding  message  containing  the  contents  of  the  specified 
array  element. 

The  inspector  loop  transformation  described  above  assumes  computing 
the  processor  on  which  the  non-local  data  resides  is  straight  forward.  For 
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example,  if  a  one  dimensional  array  is  distributed  in  a  block  manner,  simple 
functions  can  be  used  to  compute  the  processor  and  local  offset  of  a  particu¬ 
lar  array  element.  However,  there  are  many  situations  in  which  simple,  eas¬ 
ily  specified  distributed  array  partitions  are  inappropriate.  In  computations 
that  involve  an  unstructured  mesh,  we  attempt  to  partition  the  problem 
so  that  each  processor  performs  approximately  the  same  amount  of  work  to 
achieve  load  balancing  and  to  minimize  communication  overhead.  Typically, 
it  is  not  possible  to  express  the  resulting  array  partitions  in  a  simple  way.  By 
allowing  an  arbitrary  assignment  of  distributed  array  elements  to  processors, 
we  have  the  additional  burden  of  maintaining  a  data  structure  that  describes 
the  partitioning.  The  size  of  this  data  structure  must  be  the  same  as  the 
size  of  the  the  irregularly  distributed  array.  We  call  this  data  structure  a 
distributed  translation  table.  Distributed  translation  tables  are  partitioned 
between  processors  in  a  simple  straightforward  manner  described  in  Section 
3.4. 

In  order  to  access  an  array  element,  we  need  to  know  where  the  element 
is  stored  in  the  memory  of  the  distributed  machine.  This  information  is 
obtained  from  the  distributed  translation  table.  When  a  distributed  trans¬ 
lation  table  is  used  to  describe  array  mappings,  inspectors  must  be  modified 
so  that  they  access  the  distributed  table.  The  modifications  made  to  an 
inspector  are  outlined  in  Figure  2.  In  this  case,  the  distributed  translation 
table  is  used  to  determine  the  processor  on  which  an  element  resides. 

Once  the  preprocessing  is  completed,  every  processor  knows  exactly 
which  non-local  data  elements  it  needs  to  send  to  and  receive  from  the 
other  processors,  we  are  therefore  in  a  position  to  carry  out  the  necessary 
communication  and  computation.  The  loop  iB  transformed  into  an  executor 
loop.  Figure  3  outlines  the  steps  involved  and  they  apply  to  irregular  and 
regular  array  mappings.  The  initial  data  exchange  phase  follows  the  plan 
established  by  the  inspector.  When  a  processor  obtains  copies  of  non-local 
distributed  array  elements,  the  copies  are  written  into  the  processor’s  hashed 
cache.  Once  the  communication  phase  is  over,  each  processor  carries  out  its 
computation.  Each  processor  uses  locally  stored  portions  of  distributed  ar¬ 
rays  along  with  non-local  distributed  array  elements  stored  in  the  hashed 
cache.  When  the  computational  phase  is  finished,  distributed  array  elements 
to  be  stored  off-processor  are  obtained  from  the  hashed  cache  and  sent  to 
the  appropriate  off-processor  locations.  In  the  next  section  we  describe  the 
details  of  the  Parti  run  time  primitives  that  may  be  invoked  during  the 
inspector  and  executor  phases. 
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Each  processor  P: 


Figure  2: 
Mapping 


Preprocesses  its  own  loop  iterations 

Records  off-processor  fetches  and  stores  in  hashed  cache 

Consults  distributed  translation  table  to 

*  Find  location  in  distributed  memory  for  each  off-processor 
fetch  or  store 

Finds  send/receive  calls  required  for  data  exchange 

1.  P  generates  list  of  all  off-processor  data  to  be  fetched 

2.  P  sends  messages  to  other  processors  requesting  copies  of 
required  data 

3.  Other  processors  tell  P  which  data  to  send 

4.  Send/Receive  pairs  generated  and  stored 


Inspector  For  Parallel  Loop  Using  Irregular  Distributed  Array 
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•  Before  loop  or  code  segment 


1.  Data  to  be  sent  off- processor  read  from  distributed  arrays 

2.  Send/receive  calls  transport  off-processor  data 

3.  Data  written  into  hashed  cache 

•  Computation  carried  out 

-  off- processor  reads /writes  go  to  hashed  cache 

•  At  end  of  loop  or  code  segment 

1.  Data  to  be  stored  off-processor  is  read  from  hashed  cache 

2.  Send/receive  calls  transport  off-processor  data 

3.  Data  written  back  into  distributed  arrays  for  longer  term  storage 


Figure  3:  Executor  For  Parallel  Loop  on  Distributed  Memory  Multiprocessor 


3  Parti  primitives 

The  Parti  run  time  primitives  can  be  divided  into  three  categories;  primitives 
that  may  be  invoked  during  the  inspector  phase,  executor  phase  or  both 
inspector  and  executor  phase.  The  scheduler  primitive  invoked  during  the 
inspector  phase,  determines  the  send  and  receive  calls  that  are  needed  during 
the  executor  phase.  These  calls  may  be  to  either  scatter,  gather  or  perform 
reduction  operations  during  the  executor  phase.  The  distributed  translation 
table  mentioned  earlier  is  used  during  the  inspector  phase.  The  hashed 
cache  primitives  are  used  during  the  inspector  and  executor  phases.  The 
next  section  describes  the  details  of  the  scheduler,  distributed  translation 
table,  scatter,  gather,  reduction  and  hashed  cached  primitives. 

3.1  The  Scheduler  Primitive 


Processors 


Global  array  a 


local  array  a  offsets 


Figure  4:  Mapping  of  a  Global  Array  to  Processors 


We  will  use  a  simple  example  to  illustrate  the  preprocessing  carried  out  by 
the  scheduler.  Assume  we  have  a  distributed  array  a  that  is  partitioned 
among  three  processors  in  an  irregular  fashion  as  depicted  in  Figure  4  and 
there  is  a  loop  computation  such  that  the  access  pattern  of  array  a  is  as 
shown  in  Figure  5.  Each  processor  stores  its  elements  of  distributed  array 
a  in  a  local  array  a  .  Thus  processor  P\  needs  to  fetch  array  element  a(3) 
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Figure  5:  Irregular  Access  Pattern 


or  element  a  (2)  of  the  local  array  from  processor  P2  and  processors  P2  and 
P3  needs  to  fetch  a(4)  or  element  a  (2)  of  the  local  array  from  P\.  Recall 
that  the  task  of  the  scheduler  is  to  anticipate  exactly  which  send  and  receive 
communications  must  be  carried  out  by  each  processor.  The  scheduler  first 
figures  out  how  many  messages  each  processor  will  have  to  send  and  receive 
during  the  data  exchange  that  takes  place  in  the  executor  phase.  Defined 
on  each  processor  P1  is  an  array  nmsgs1.  Each  processor  sets  its  value  of 
nmsgsx(j)  to  1  if  it  needs  data  from  processor  j  or  to  0  if  it  does  not. 
The  scheduler  then  updates  nmsgs  on  each  processor  with  the  element-by¬ 
element  sum  nmsgsl(j) «-  nmsgsk(j).  This  operation  utilizes  a  function 
that  imposes  a  fan-in  tree  to  find  the  sums.  At  the  end  of  the  fan-in,  on  all 
processors,  the  entries  of  nmsgs  are  identical.  The  value  nmsgs(j)  is  equal 
to  the  number  of  messages  that  processor  pj  must  send  during  the  exchange 
phase.  In  our  example  scenario,  we  see  that  at  the  end  of  the  fan  in,  the 
value  of  nmsgs  on  each  processor  is  [2,1,0]  (Figure  6).  Thus  Pi  is  able 
to  determine  that  it  needs  to  send  data  to  two  other  (as  yet  unspecified) 
processors,  P2  needs  to  Bend  data  to  one  processor  and  P3  does  not  need  to 
send  any  data. 

At  this  point,  each  processor  transmits  to  the  appropriate  processor,  a 
list  of  required  array  elements.  This  list  contains  the  local  offsets  of  the 
global  array  elements.  In  our  example,  P\  sends  a  message  to  P2  requesting 
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element  2  of  the  local  array  a';  Pj  and  P3  send  a  message  to  Pi  requesting 
element  2  of  the  local  array  a .  Each  processor  now  has  the  information 
required  to  set  up  the  send  and  receive  messages  that  are  needed  to  carry 
out  the  scheduled  communications  (Figure  7). 


P(l)  needs 

data  from 
P(2) 

P(2)  needs 

data  from 

PCD 

P(3)  needs 

data  from 
p(i) 

Input  to  sum 

tree 

[0  1  0] 

[1  0  0] 

[1  0  0] 

Output  from 

[2  1  0] 

1 - 1 

O 

tH 

[2  1  0] 

sum  tree 
distributed  to 
all  processors 


Figure  6:  Computing  the  number  of  Send  Messages 


3.2  Data  Exchange  Primitives 
Data  exchangers  can  be  called  by  each  processor  to: 
gather  data  from  other  processors, 
scatter  data  to  other  processors,  or 
perform  global  reduction  operations 

These  exchangers  use  state  information  stored  by  the  scheduler.  As  de¬ 
scribed  in  the  previous  section  the  scheduler  determines  the  send  and  receive 
calls  needed  to  carry  out  data  exchanges.  The  scheduler  is  not  given  any 


Send  Processors 


Messages  sent  by  the  processors 


Receiving  Processors 


Data  sent  by  the  processors:  local  array  a 


Figure  7:  Final  Message  Pattern 


information  about  memory  locations  -  it  involves  only  processors  and  local 
indices. 

When  a  processor  P  calls  a  data  exchanger,  it  passes  to  the  exchanger 
routine  the  starting  address  of  the  first  local  array  element  in  its  memory. 
We  call  this  address  Ap.  The  exchanger  routines  use  Ap  to  read  or  write 
distributed  array  elements.  The  schedule  generated  by  the  scheduler  can 
be  reused.  A  schedule  can  also  be  used  to  carry  out  identical  patterns  of 
data  exchange  on  several  different  identically  distributed  arrays  or  on  several 
different  identically  distributed  array  sections.  The  same  schedule  can  be 
reused  to  repeatedly  carry  out  a  particular  pattern  of  data  exchange  on  a 
single  distributed  array,  and  any  of  the  data  exchange  primitives  can  make 
use  of  a  given  schedule. 

3.3  Calling  Sequence  of  Scheduler  and  Data  Exchanger 

In  this  section,  we  give  a  specific  example  of  the  calling  sequence  used  to 
invoke  the  schedule  and  data  exchange  primitives.  We  consider  the  following 
two  Parti  procedure  calls: 
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call  scheduler(id,n,hashed-cache,local-indices, processors) 

call  gather-exchanger(id,hashed-cache,local-array). 

In  this  example,  processor  arranges  to  obtain  copies  of  specified  off-processor 
data  elements,  these  copies  are  placed  in  the  hash  table  hashed-cache. 

Each  processor  passes  to  scheduler  a  list  of  off- processor  local  array  in¬ 
dices.  The  scheduler  will  build  a  schedule  that  will  make  it  possible  for  P 
to  obtain  n  data  elements.  P  will  obtain  data  element  *,  1  <  t  <  n  from 
processor  processor s(t),  local  index  local  -  indices(i).  A  previously  allo¬ 
cated  hash  table  hashed  —  cache  is  used  to  eliminate  duplicate  off-processor 
indices,  scheduler  returns  an  integer  id  which  is  be  used  by  a  subsequent 
call  to  gather-exchanger. 

Each  processor  the  calls  gather-exchanger,  gather-exchanger  passes  the 
address  of  the  memory  location  local  —  array  in  which  each  processor  stores 
its  portion  of  a  distributed  array,  gather-exchanger  returns  copies  of  the 
requested  off-processor  array  elements,  these  copies  are  placed  in  the  hash 
table  hashed-cache. 

3.4  The  Translation  Table 

We  are  able  to  allow  a  user  to  assign  globally  numbered  distributed  array 
elements  to  processors  in  an  irregular  pattern,  using  a  distributed  translation 
table.  Recall  that  the  scheduler  and  the  data  exchangers  deal  with  indices  of 
arrays  that  are  local  to  each  processor.  The  translation  primitives,  however, 
assume  that  distributed  array  elements  have  been  assigned  global  indices. 

The  procedure  build-translation-table  constructs  the  distributed  transla¬ 
tion  table.  Each  processor  passes  build-translation-table  a  set  of  globally 
numbered  indices  for  which  it  will  be  responsible.  The  distributed  transla¬ 
tion  table  may  be  striped  or  blocked  across  the  processors.  With  a  striped 
translation  table,  the  translation  table  entry  for  global  index  i  is  stored  on 
processor  i  mod  numprocs  where  numprocs  is  the  number  of  processors.  In 
a  blocked  translation  table,  translation  table  entries  are  partitioned  into  a 
number  of  equal  Bized  ranges  of  contiguous  integers,  these  ranges  are  placed 
in  consecutively  numbered  processors. 

Dereference  accesses  the  distributed  translation  table  constructed  in 
build-translation-table.  For  a  given  distributed  array,  dereference  is  passed  a 
set  of  global  indices  that  need  to  be  located  in  distributed  memory.  Derefer¬ 
ence  returns  the  processors  and  memory  locations  where  the  specified  global 
indices  are  stored. 
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We  will  illustrate  tlie  use  of  these  primitives  using  the  same  mapping  as 
in  Figure  4  except  that  the  number  of  processors  equals  2.  Two  processors 
call  build-translation-table.  Thus  P\  claims  responsibility  for  indices  1  and 
4,  while  Pj  claims  responsibility  for  indices  2  and  3.  We  assume  that  the 
translation  table  is  partitioned  between  2  processors  by  blocks.  We  depict 
the  translation  table  data  structure  in  Table  1.  Each  entry  of  the  translation 
table  assigns  a  processor  and  a  local  array  index  to  each  globally  indexed 
distributed  array  element.  In  our  example,  translation  table  information 
about  global  indices  1  and  2  is  stored  in  processor  1,  while  information 
about  global  indices  3  and  4  is  stored  in  processor  2. 

To  continue  our  example,  assume  that  both  processors  use  the  derefer¬ 
ence  primitive  to  find  assigned  processors  and  local  indices  corresponding  to 
particular  global  distributed  array  indices.  In  Table  2  we  depict  the  results 
obtained  when  processor  1  dereferences  global  indices  1  and  3,  and  processor 
2  dereferences  global  indices  2,  3  and  4. 
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3.5  The  Hashed  Cache 

The  usefulness  of  the  Parti  primitives  described  in  Section  3  can  be  enhanced 
by  coupling  these  primitives  with  hash  tables.  The  hash  table  records  the 
numerical  value  associated  with  each  distributed  array  element.  The  hash 
table  also  records  the  processor  and  local  index  associated  with  the  element. 

Dereference  uses  the  hash  table  to  reduce  the  volume  of  interprocessor 
communication.  Recall  that  dereference  returns  the  processor  assignments 
and  the  memory  locations  that  correspond  to  a  given  list  of  distributed 
array  indices.  Each  distributed  array  index  may  appear  several  times  in  lists 
passed  to  dereference.  The  hash  table  is  used  to  remove  these  duplicates. 

The  scheduler  and  the  data  exchange  procedures  use  hash  tables  to  store 
copies  of  off-processor  distributed  array  elements.  Lists  of  off-processor  dis¬ 
tributed  array  elements  passed  to  the  scheduler  may  have  duplicates,  the 
scheduler  uses  the  hash  table  to  remove  these.  The  gather  data  exchanger 
(or  gather-exchanger )  fetches  copies  of  off-processor  distributed  array  ele¬ 
ments  and  then  places  the  off-processor  distributed  array  values  in  a  hash 
table.  Similarly,  scatter-exchanger  obtains  copies  of  off-processor  distributed 
array  elements  from  a  hash  table  and  writes  the  values  obtained  into  a  spec¬ 
ified  local  array  element  on  a  designated  processor.  Primitives  to  support 
accumulations  to  non-local  memory  use  hash  tables  in  the  same  way  scatter 
does. 

Parti  supplies  a  number  of  other  primitives  that  support  reading  from, 
as  well  as  writing  and  accumulating  to,  hash  tables.  When  off-processor 
accumulations  must  be  performed,  we  first  carry  out  all  possible  accumula¬ 
tions  to  copies  of  distributed  array  elements  in  hash  table,  then  we  perform 
an  accumulation  data  exchange. 

Currently,  we  use  a  hash  function  that,  for  a  hash  cache  of  size  2*, 
masks  the  lower  k  bits  of  the  key.  The  key  is  formed  by  concatenating  the 
processor-local  index  pair  that  corresponds  to  a  distributed  array  reference. 

4  The  ARF  Language 

We  have  described  in  earlier  sections  the  2  distinct  layers  of  the  compiler. 
We  will  now  briefly  describe  the  extensions  that  we  have  added  to  Fortran 
77  to  create  the  ARF  (ARguably  Fortran)  language.  ARF  is  an  interface 
between  the  application  programs  and  the  Parti  run-time  support  primitives. 
The  ARF  compiler  generates  inspector  and  executor  loops  with  embedded 
primitives. 
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Distributed  arrays  are  declared  in  ARF  source.  These  distributed  arrays 
can  either  be  partitioned  between  processors  in  a  regular  manner  (e.g.  equal 
sized  blocks  of  contiguous  array  elements  assigned  to  each  processor),  or 
in  an  irregular  manner.  An  ARF  user  declares  a  mapping  into  distributed 
memory  for  each  distributed  array.  When  an  array  is  to  be  partitioned  in  an 
irregular  fashion,  mapping  information  is  specified  in  a  regularly  distributed 
integer  array.  Element  i  of  the  integer  array  describes  the  processor  to  which 
element  i  of  the  distributed  array  is  to  be  mapped.  Examples  are  shown 
below, 


51  distributed  regular  using  block  real  k(SIZE) 

52  distributed  regular  using  block  integer  map (SIZE) 

53  distributed  irregular  using  map  real  y(SIZE) . 

SI  declares  that  k  is  a  real  array,  distributed  in  a  regular  block  manner, 
S2  declares  that  map  is  an  integer  array,  also  distributed  in  a  regular  block 
manner.  S3  declares  a  real  array  y  whose  distribution  is  to  be  determined 
by  the  distributed  integer  array  map.  In  the  examples  we  give  in  this  paper, 
all  integer  arrays  used  to  specify  irregular  mappings  were  produced  by  hand 
coded  partitioning  procedures  and  then  passed  to  an  ARF  routine. 

Another  addition  to  Fortran  77  is  the  on  clause.  The  on  clause  has 
been  originally  implemented  in  Kali  [14].  It  is  a  mechanism  by  which  the 
user  has  control  over  distributing  the  iteration  space  or  work  load  among 
processors.  Distribute  do  is  an  ARF  language  extension,  this  implies  that  the 
loop  iterations  in  a  given  do  loop  should  be  distributed  between  processors. 
In  the  next  section  we  use  two  examples  to  illustrate  the  transformations 
and  optimizations  performed  by  the  ARF  compiler.  These  message  passing 
Fortran  codes  were  generated  by  the  ARF  compiler. 

4.1  Code  Generation  by  the  ARF  Compiler 

The  ARF  compiler  transforms  an  ARF  program  into  a  target  program  which 
incorporates  the  primitives  needed  to  efficiently  carry  out  the  distributed 
computation.  The  kernels  we  present  here  have  been  coded  in  ARF,  com¬ 
piled  and  run  on  an  iPSC/860;  in  Section  5  we  will  present  performance 
data  obtained  from  both  kernels. 
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4.1.1  Sparse  Block  Matrix  Vector  Multiply 

In  Figure  8  we  present  an  ARF  program  that  carries  out  a  block  sparse 
matrix  vector  multiply.  This  kernel  was  obtained  from  an  iterative  solver 
produced  for  a  program  designed  to  calculate  fluid  flow  for  geometries  de¬ 
fined  by  an  unstructured  mesh  [26].  The  matrix  is  assumed  to  have  size  4 
by  4  blocks  of  non-zero  entries.  Statements  S4  and  S5  are  loops  that  sweep 
over  the  non-zero  entries  in  each  block. 

Integer  array  partition  is  local  to  each  processor  and  enumerates  a  list 
of  indices  assigned  to  the  processor.  As  mentioned  earlier,  the  current  im¬ 
plementation  partitions  only  one  dimension,  the  last  dimension  of  the  array. 
The  Parti  primitives,  however,  do  support  a  broader  class  of  array  map¬ 
pings  [7].  Thus  partition  describes  the  partitioning  of  the  last  dimension 
of  the  arrays  declared  in  statements  SI  and  S2.  The  ARF  compiler  uses 
the  information  in  partition  to  make  calls  to  primitives  that  initialize  the 
distributed  translation  tables.  These  distributed  translation  tables  are  used 
to  describe  the  mapping  of  x,  y  ,  cola,  ncols  and  f  (statements  Si  and 
S2). 

The  partitioning  of  computational  work  is  specified  in  statement  S3  by 
an  on  clause.  In  this  example,  distributed  array  partition  is  used  to  specify 
which  loop  iterations  are  to  be  carried  out  on  each  processor.  The  reference 
x(m,cols(j  ,i))  in  S6  may  require  off-processor  references.  ARF  must  con¬ 
sequently  generate  an  inspector  to  produce  a  schedule  and  a  hash  table  to 
handle  accesses  to  the  distributed  array  x.  A  reference  to  the  irregularly 
distributed  array  f  occurs  in  statement  S6.  Note  that  distributed  array 
f  is  irregularly  distributed  using  array  partition  and  that  partition  is 
also  used  by  the  on  clause  to  partition  loop  iterations  in  S3.  It  can  there¬ 
fore  be  deduced  that  the  reference  to  f  in  statement  S6  is  on-processor, 
partition  specifies  how  distributed  array  elements  and  loop  iterations  are 
to  be  distributed  between  processors.  A  separate  partitioning  routine  gen¬ 
erates  partition.  In  this  paper,  we  simply  assume  that  array  partition 
is  passed  to  the  sparse  matrix  vector  multiply  kernel  after  having  been  gen¬ 
erated  elsewhere. 

The  ARF  compiler  generates  an  inspector  and  an  executor  to  run  on 
each  processor.  The  work  of  the  inspector  is  carried  out  on  each  processor 
as  follows: 

Call  build-translation-table  using  the  mapping  defined  by  array  partition. 

Generate  distributed  translation  table  Tpar tition- 
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Call  dereference  to  employ  translation  table  Tpartition  to  find  proces¬ 
sor  assignments,  PA  and  local  indices,  LA  for  consecutive  references  to 
x(m,  cols(j,  i)). 

Pass  PA  and  LA  to  scheduler  generate  schedule  S. 

Use  PA  and  LA  to  setup  hash  table  H. 

The  executor  generated  by  ARF  on  processor  P  is  depicted  in  Figure 
9.  In  Figure  9  we  use  Fortran  90  notation  where  appropriate  to  enhance 
readability.  OfF-prococessor  elements  of  x  are  gathered  and  placed  in  a 
hash  table  H  (step  I  Figure  9).  Values  from  x  are  obtained  from  H  or  from 
local  memory  as  is  appropriate  (step  Ila,  Figure  9).  Arrays  PA  and  LA  are 
used  to  distinguish  local  from  off-processor  array  accesses.  In  step  lib,  we 
accumulate  to  y.  Note  that  the  declarations  in  Si  and  S3  in  Figure  8  allow 
the  compiler  to  determine  that  accumulations  to  y  are  local. 

4.1.2  The  Fluxroe  Kernel 

In  section  5  we  will  present  the  ARF  compiler  output  of  a  more  complex 
kernel.  This  kernel  is  taken  from  a  program  that  computes  convective  fluxes 
using  a  method  based  on  Roe’s  approximate  Riemann  solver  [27],  [28];  we 
will  call  this  kernel  fluxroe.  Fluxroe  computes  the  flux  across  each  edge  of 
an  unstructured  mesh.  Fluxroe  accesses  elements  of  array  yold,  carries  out 
flux  calculations  and  accumulates  results  to  array  y.  As  was  the  case  in  the 
sparse  block  matrix  vector  multiply  kernel,  four  sections  of  each  array  are 
distributed  and  accessed  in  an  identical  manner.  In  Figure  10  we  depict  an 
outline  of  the  fluxroe  kernel.  We  denote  the  indices  of  the  two  vertices  that 
comprise  edge  t  by  nl  =  n(»,l)  and  n2  =  n(t,2).  To  compute  the  fluxes 
f(i,k)  across  the  ith  edge,  we  need  to  access  yold(nl,  k)  and  yold(n2,k), 
for  1  <  k  <  4  (part  I  Figure  10).  Once  the  fluxes  have  been  computed, 
we  add  the  newly  computed  flux  values  f(i,k)  to  y(nl,k)  and  subtract 
f(i,k)  from  y(n2,k)  (part  III  Figure  10).  Note  that  arrays  y  and  yold 
are  irregularly  distributed  using  y-partition,  and  that  distributed  array 
node  is  irregularly  distributed  using  edge-partition.  Since  the  on  clause 
in  the  distributed  do  statement  also  uses  edge-partition  to  specify  how 
loop  iterations  are  to  be  partitioned,  no  off-processor  references  are  made  to 
node  in  part  I  Figure  10. 

In  the  inspector,  we  need  to  compute  a  schedule  S„ i  for  the  off-processor 
additions  to  y(nl,k)  (part  Ilia  Figure  10),  and  a  different  schedule  5ns  for 
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51  distributed  irregular  using  partition  real*8  x(4,n),  y(4,n),f(4,4,majccol8,n) 

52  distributed  irregular  using  partition  integer  cols(9,n),  ncols(n) 


...  initialization  of  local  variables  ... 

S3  distributed  do  i=l,n  on  partition 

do  j=l,ncols(i) 

S4  do  k=l,4 

sum  =  0 

55  do  m  =  1,4 

56  sum  =  sum  +  f(m,kj,i)*x(m,col8(j,i)) 
enddo 

y(k4)  =  y(k^)  +  sum 

enddo 

enddo 

enddo 


Figure  8:  ARF  Sparse  Block  Matrix  Vector  Multiply 
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I.  call  gather-exchanger  using  schedule  S  to  obtain  off-processor  elements 
of  z 

gather-exchanger  places  gathered  data  in  hash  table  H 
count  =  1 

II.  for  all  rows  i  assigned  to  processor  P 
do  j=l,ncols(i) 

do  k=  1,4 
sum  =  0 

Ha.  if  PA(count)  ==  P  then 

vx(l:4)  =  x(l:4,LA(count)) 
else 

Use  PA(count),  LA(count)  to  get  vx(l:4)  from  hash  table 
H 

endif 
do  m=l,4 

sum  =  sum  +  f(mJk,j,i)*vx(m) 
end  do 

lib.  y(k4)  =  y(k4)  +  sum 
end  do 

count  =  count  +  1 
end  do 


Figure  9:  Executor  generated  from  ARF  for  Sparse  Block  Matrix  Vector 
Multiply 
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the  off-processor  subtractions  from  y(n2,  k)  (  part  Illb  Figure  10).  When 
parallelized,  fluxroe  reads  as  well  as  accumulates  to  off-processor  distributed 
array  locations.  As  we  pointed  out  in  Section  3.2,  any  of  the  data  exchange 
primitives  can  use  the  same  schedule.  We  can  use  schedule  Sn  1  to  gather 
off-processor  references  from  yold(k,nl)  (part  la  Figure  10),  and  we  can 
use  schedule  Sni  to  gather  off-processor  references  from  yold(k,n2)  (part  lb 
Figure  10). 

The  work  of  the  inspector  is  carried  out  as  follows: 

Call  build-translation-table  using  mapping  defined  by  array  y-partition. 

Generate  distributed  translation  table  Tv-partition • 

Call  dereference  to  employ  translation  table  Ty_partjtjon  to  find: 

1.  Processor  assignments  PAni  and  local  indices  LA&i  for  consecutive 
add  accumulations  to  y(k,n  1)  (the  same  PAni  and  LAni  can  be 
used  for  consecutive  references  to  y(fe,nl)). 

2.  Processor  assignments  PAnj  and  local  indices  LA^  for  consecutive 
substract  accumulations  to  y(k,  n2)  (the  same  PAb2  and  LA^  can 
be  used  for  consecutive  references  to  y(ktn 2)). 

Pass  PAni  and  LA„i  to  scheduler  to  obtain  schedule  Sni;  pass  PAn2  and 

LAn2  to  scheduler  to  obtain  schedule  Snj. 

Setup  hash  tables  Hni  and  Hn2. 

Figure  11  outlines  the  executor  produced  by  ARF  on  processor  P.  In 
Figure  11  we  use  Fortran  90  notation  where  appropriate  to  enhance  read¬ 
ability.  In  step  la  and  lb  we  gather  two  sets  of  off-processor  elements  of 
yold  using  schedules  5nl  and  5^3-  In  step  II  we  access  the  appropriate  ele¬ 
ments  of  yold  either  from  local  memory  or  from  the  appropriate  hash  table, 
and  in  step  in  we  use  yold  values  to  calculate  fluxes.  If  the  newly  com¬ 
puted  fluxes  are  to  be  accumulated  to  a  local  element  of  distributed  array 
y,  the  appropriate  addition  or  subtraction  is  carried  out  at  once  (  steps  IVa 
and  IV c  Figure  11).  When  a  flux  must  be  accumulated  to  a  off-processor 
element  of  y,  we  accumulate  the  flux  to  a  copy  of  y  stored  in  a  hash  table 
(steps  IVb  and  IVd  Figure  11).  When  all  fluxes  have  been  calculated  and  all 
local  accumulations  are  completed,  we  then  call  the  scatter-add  and  scatter- 
subtract  exchangers.  These  exchangers  carry  out  the  needed  off-processor 
accumulations. 
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distributed  irregular  using  y-p&rtition  real*8  yold(4, Number-nodes), 
y(4,Number-nodes) 

distributed  irregular  using  edge-partition  integer  node(2, Number-edges) 

...  initialization  of  local  variables  ... 

distributed  do  i  =  1, Number-edges  on  edge-partition 

I.  nl  =  node(14) 
n2  =  node(24) 

do  k=l,4 

la.  Va(k)  =  yold(k,nl) 

lb.  Vb(k)  =  yold(k,n2) 
end  do 

II.  Calculate  flux  using  Va(k),  Vb(k) 

HI.  do  k=l,4 

Ilia.  y(k,nl)  =  y(k,nl)  +  flux(k) 

Illb.  y(k,n2)  =  y(k,n2)  -  flux(k) 
end  do 

end  do 

Figure  10:  ARF  Kernel  From  Riemann  Solver 
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The  current  version  of  the  ARF  compiler  attempts  to  minimize  the  num¬ 
ber  of  schedules  to  be  computed.  We  might  have  produced  a  single  schedule 
for  all  off-processor  yold  data  accesses.  If  the  inspector  produced  a  single 
schedule  for  all  accesses  to  yold,  it  would  have  been  necessary  to  compute 
three  different  schedules  in  the  inspector.  Computing  a  single  schedule  for 
all  references  to  yold  might  have  led  to  a  more  efficient  executor  at  the  cost 
of  a  more  expensive  inspector. 

4.2  Memory  Utilization 

We  will  give  an  overview  of  some  of  the  memory  requirements  exacted  by 
the  methods  described  in  this  section,  and  suggest  some  ways  in  which  these 
requirements  can  be  reduced.  Many  sparse  and  unstructured  programs  use 
large  integer  arrays  to  determine  reference  patterns.  In  this  respect,  the 
kernels  depicted  here  are  typical.  In  Figure  8,  a  9 n  element  integer  array 
cols  is  used  for  this  purpose;  while  in  Figure  10,  a  size  2  *  Number  -  edges 
array  node  is  employed.  The  executors  depicted  in  Figure  9  and  Figure  11 
replace  cols  and  node  with  local  arrays  that  store  the  processor  assignments 
and  the  local  indices  for  references  to  irregularly  distributed  arrays.  In  the 
kernels  in  Figure  8,  the  sum  of  the  number  of  elements  used  in  all  processors 
to  store  both  processor  assignments  and  local  indices  is  no  larger  than  18n; 
in  Figure  10  the  parallelized  code  uses  a  total  of  4  *  Number  -edges  elements. 

The  amount  of  additional  storage  needed  for  the  parallelized  code  can 
be  reduced  in  the  following  simple  manner.  The  iterations  I  of  a  loop  are  di¬ 
vided  into  into  two  disjoint  sets.  The  first  set  of  iterations  is  Iiocali  where 
all  memory  references  are  to  locally  stored  array  elements.  The  second 
set  is  Ioff-jn'ocMsar>ui  this  set,  each  iteration  contains  some  off- processor 
distributed  array  reference.  In  this  case  we  need  only  to  list  processor  as¬ 
signments  for  loop  iterations  I0ff -processor-  Since  it  is  frequently  possible 
to  map  problems  so  that  most  memory  references  are  local  to  a  processor, 
a  substantial  memory  savings  will  result. 

The  schemes  described  thus  far  would  use  very  large  quantities  of  extra 
memory  when  attempting  to  handle  a  loop  in  which  a  small  number  of 
distributed  array  elements  are  accessed  many  times.  For  instance,  consider 
the  following  loop  where  f  is  a  function  defined  so  that  1  <  f(i)  <  2  for  any 
i. 
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la.  call  gather-exchanger  using  schedule  Sni  to  obtain  first  set  of  off- 
processor  elements  of  yold 

gather-exchanger  places  data  in  hash  table  Hai. 

lb.  call  gather-exchanger  using  schedule  5„2,  to  obtain  second  set  of  off- 
processor  elements  of  yold 

gather-exchanger  places  data  in  hash  table  H^. 

count  =  1 

II.  for  edges  i  assigned  to  processor  P 

if  (Pi4„i(count)  .EQ.  P )  then 

va(l:4)  =  yold(l:4,£j4ni(count))  else 
get  va(l:4)  from  hash  table  Hnj 

endif 

if  (Pi4n2(count)  .EQ.  P)  then 

vb(l:4)  =  yold(l:4,Zi4na(count))  else 
get  vb(l:4)  from  hash  table 

endif 

III.  Calculate  fluxes  flux(l:4)  using  va(l:4)  and  vb(l:4) 

IV.  if  PAni  (count)  .EQ.  P  then 

IVa.  yold(  1  -A,LAni ( count ) )  =  yold(  1:4, ZAni (count))  +  flux(l:4) 
else 

IVb.  Accumulate  flux(l:4)  to  hash  table  Hni 
endif 

if  PAtai count)  .EQ.  P  then 

IVc.  yold(  1 :4 ,X ( count ) )  =  yold(l:4,iAn2(count))  -  flux(l:4) 
else 

IVd.  Accumulate  flux(l:4)  to  hash  table  Has 
endif 

end  do 

count  =  count +1 

Va.  Call  scatter-add  exchanger  using  schedule  Sai  and  hash  table  Hai. 

Vb.  Call  scatter-subtract  exchanger  using  schedule  Sna  and  hash  table  Hb2- 


Figure  11:  Executor  generated  from  ARF  for  Fluxroe  Kernel 
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distributed  irregular  partition  y 


do  i=l,  HUGE  -  NUMBER 
....  y(f(i)) 
end  do 

In  the  above  loop,  the  reference  pattern  of  distributed  array  y  is  de¬ 
termined  by  f .  In  this  example,  at  most  two  distinct  elements  of  y  are 
referenced  in  the  loop.  Loops  of  this  sort  can  be  handled  by  using  a  hash  ta¬ 
ble  to  store  processor  and  local  index  assignments  for  each  distinct  memory 
reference.  In  our  example,  each  processor  would  have  to  store  processor  and 
local  index  assignments  for  no  more  than  two  references  to  distributed  array 
y.  There  is  a  performance  penalty  that  must  be  paid  for  using  a  hash  table 
to  find  processor  and  local  index  assignments  for  distributed  array  elements. 
After  examining  a  variety  of  sparse  and  unstructured  codes,  we  chose  not 
to  implement  the  method  described  in  this  section  in  the  ARF  compiler.  In 
[19],  we  present  an  analysis  of  the  type  of  time  and  space  tradeoffs  outlined 
in  this  section. 

5  Experimental  Results 

In  this  section  we  present  a  range  of  performance  data  that  summarizes  the 
effects  of  preprocessing  on  measures  of  overall  efficiency  and  that  gives  some 
insight  into  the  performance  effects  of  problem  irregularity  and  partitioning. 
Our  computational  experiments  employed  the  fluxroe  kernel  and  the  block 
sparse  matrix  vector  multiply  kernel.  Both  kernels  were  coded  in  ARF;  the 
parallelized  benchmark  numbers  we  present  were  obtained  from  programs 
generated  by  the  ARF  compiler.  It  should  be  noted  that  the  syntax  accepted 
by  our  ARF  compiler  differs  in  some  minor  ways  from  the  that  presented  in 
the  previous  sections. 

The  experiments  described  in  this  paper  used  either  a  32  processor 
iPSC/860  machine  located  at  ICASE,  NASA  Langley  Research  Center  or  a 
128  processor  iPSC/860  machine  located  at  Oak  Ridge  National  Laborato¬ 
ries.  Each  processor  had  8  megabytes  of  memory.  We  used  the  Greenhill 
1.8.5  Beta  version  C  compiler  to  generate  code  for  the  80860  processors. 
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5.1  Unstructured  Mesh  Data 


We  use  as  input  data  a  variety  of  unstructured  meshes;  both  actual  un¬ 
structured  meshes  obtained  from  aerodynamic  simulations  and  synthetically 
generated  meshes. 

Unstructured  Meshes  from  Aerodynamics  :  We  use  two  different 
unstructured  meshes  generated  from  aerodynamic  simulations. 

Mesh  A:  A  21,672  element  mesh  generated  to  carry  out  an  aero¬ 
dynamic  simulation  involving  a  multielement  airfoil  in  a  landing 
configuration  [17].  This  mesh  has  11143  points. 

Mesh  B:  A  37,741  element  mesh  generated  to  simulate  a  4.2  % 
circular  arc  airfoil  in  a  channel  [11].  This  mesh  has  19155  points. 

Each  mesh  point  is  associated  with  an  ( x,y )  coordinate  in  a  physical 
domain.  We  use  domain  information  to  partition  the  mesh  in  three 
different  ways;  strips,  orthogonal  binary  dissection  algorithm  ( [5],  [10]) 
and  another  mesh  partitioning  algorithm  jagged  partitioning,  described 
in  [24]. 

Synthetic  Mesh  from  Templates 

A  finite  difference  template  is  used  to  link  K  points  in  a  square  two  di¬ 
mensional  mesh.  This  connectivity  pattern  is  incrementally  distorted. 
Random  edges  are  introduced  subject  to  the  constraint  that  in  the  new 
mesh,  each  point  still  requires  information  from  K  other  mesh  points. 

This  mesh  generator  makes  the  following  assumptions: 

I.  The  problem  domain  consists  of  a  2-dimensional  square  mesh  of 
N  points, 

II.  Each  point  is  initially  connected  to  X  neighbors  determined  by  a 
finite  difference  template, 

III.  With  probability  q,  each  mesh  link  is  replaced  by  a  link  to  a 
randomly  chosen  mesh  point. 

Note  that  when  q  is  equal  to  0.0,  no  mesh  links  are  modified  and 
no  changes  are  introduced  by  step  III.  When  q  is  equal  to  1.0  we 
have  a  completely  random  graph.  In  this  paper  we  will  make  use  of 
two  templates.  One  template  connects  each  point  to  its  four  nearest 
neighbors  (K=4);  the  other  template  connects  each  point  to  both  its 
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four  nearest  neighbors  as  well  as  to  each  of  its  four  diagonal  neighbors 
(*■=8).  We  refer  to  the  K  =  4  template  as  a  five  point  template 
and  we  refer  to  the  K= 8  template  as  a  nine  point  template.  In  the 
experiments  to  be  described  in  the  rest  of  this  section,  we  employed  a 
256  by  256  point  mesh. 

5.2  Overall  Performance 

We  first  present  data  to  give  an  overview  of  the  performance  we  obtained 
on  the  iPSC/860  from  the  ARF  compiler  output.  In  the  results  depicted 
in  this  section,  we  use  a  blocked  distributed  translation  table.  In  Table  3 
we  present  a)  the  inspector  time:  time  required  to  carry  out  the  inspector 
preprocessing  phase,  b)  computation  time:  the  time  required  to  perform 
computations  in  the  iterative  portion  of  the  program  and  c)  the  commu¬ 
nication  time:  the  time  required  to  exchange  messages  within  the  iterative 
portion  of  the  program.  The  inspector  time  includes  the  time  required  to 
set  up  the  needed  distributed  translation  table  as  well  as  the  time  required 
to  access  the  distributed  translation  table  when  carrying  out  the  preprocess¬ 
ing  in  the  inspector.  Unstructured  Meshes  A  and  B  were  partitioned  using 
orthogonal  binary  dissection.  In  these  experiments,  the  ratio  of  the  time 
required  to  carry  out  the  inspector  to  the  computation  time  required  for  a 
single  iteration  ranged  from  a  factor  of  0.7  to  a  factor  of  3.6.  Most  of  the 
preprocessing  time  goes  to  setting  up  and  using  the  distributed  translation 
table.  For  instance,  consider  the  block  matrix  vector  multiply  on  64  proces¬ 
sors  using  the  21,672  element  mesh.  The  total  preprocessing  cost  was  122 
milliseconds,  of  which  111  milliseconds  went  to  work  related  to  the  trans¬ 
lation  table.  We  define  parallel  efficiency  for  a  given  number  of  processors 
P  as  the  sequential  time  divided  by  the  product  of  the  execution  time  on 
P  processors  times  P.  The  sequential  time  was  measured  using  a  separate 
sequential  version  of  the  each  kernel  run  on  a  single  node  of  the  iPSC/860. 
In  Table  3  we  depict  under  the  column  single  sweep  efficiency,  the  parallel 
efficiencies  we  would  obtain  were  we  required  to  preprocess  the  kernel  each 
time  we  carried  out  the  calculations.  In  reality,  preprocessing  time  can  be 
amortized  over  multiple  mesh  sweeps.  If  we  neglect  the  time  required  to  pre- 
process  the  problem  in  computing  parallel  efficiencies,  we  obtain  the  second 
set  of  parallel  efficiency  measurements;  the  executor  efficiency  presented  in 
Table  3.  The  executor  efficiencies  for  64  processors  ranged  from  0.48  to 
0.59,  while  the  single  sweep  efficiencies  ranged  from  0.10  to  0.17. 

In  the  experiments  depicted  in  Table  3,  the  time  spent  computing  is 
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at  least  a  factor  of  2  greater  than  the  communication  time.  The  executor 
efficiencies  are,  however,  impacted  by  the  fact  that  the  computations  in  the 
parallelized  codes  are  carried  out  less  efficiently  than  those  in  the  sequential 
program.  The  parallel  code  spends  time  accessing  the  hashed  cache.  It  also 
needs  to  perform  more  indirections  than  does  the  sequential  program. 


Table  3:  Performance  on  different  number  of  processors 


nprocs 

inspector 

time(ms) 

comp 

time(ms) 
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time(ms) 
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32 
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23 

S9 

64 
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69 

14 
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In  Table  4,  we  investigate  the  performance  of  the  fluxroe  kernel  for 
meshes  with  varying  degrees  of  regularity  and  for  varying  mesh  mappings. 
We  used  32  processors  in  this  experiment.  In  Table  4  we  depict  synthetic 
meshes  derived  from  5  and  9  point  stencils  with  probability  of  edge  move  q 
equal  to  either  0.0  or  0.4.  These  meshes  were  mapped  by  1-D  strips  or  by  2-D 
blocks.  As  one  might  expect,  for  the  synthetic  meshes  the  communications 
cost 8  increase  dramatically  for  increasing  q.  We  see  these  dramatic  increases 
because  both  the  volume  of  communication  required  and  the  number  of 
messages  sent  per  node  are  much  higher  for  large  q.  Preprocessing  costs 
also  increased  with  q  but  while  the  communications  costs  went  up  by  at 
least  a  factor  of  16,  preprocessing  costs  went  up  by  at  most  a  factor  of  1.8. 

We  also  depict  in  Table  4  results  from  Meshes  A  and  B.  We  partitioned 
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Table  4:  Performance  on  32  processors  with  different  meshes 


nprocs 
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the  mesh  in  three  different  ways;  strips,  the  orthogonal  binary  dissection 
algorithm  and  jagged  partitioning.  Both  binary  dissection  and  the  jagged 
partitioning  algorithm  break  the  domain  into  two  dimensional  rectangular 
regions,  and  the  two  methods  produce  very  similar  performance  results. 

5.3  Breakdown  of  Inspector  Overhead 

In  Table  5,  we  measure  the  cost  of  dereferencing  and  scheduling  the  fluxroe 
kernel  on  different  numbers  of  processors.  We  again  use  a  blocked  translation 
table.  We  use  a  five  point  template  and  we  partition  the  mesh  either  into 
1-D  strips  or  into  2-D  blocks.  When  the  mesh  is  partitioned  into  strips, 
dereference  involves  mostly  local  data  accesses  since  the  domain  data  and 
the  translation  table  are  identically  partitioned.  When  strip  partitioning 
is  used,  translation  table  initialization  does  not  involve  any  communication. 
The  measurements  presented  in  Table  5  are  defined  in  the  following  manner: 

Executor  time  is  the  computation  and  communication  time  required  to 
execute  the  kernel;  it  does  not  include  time  required  for  preprocessing. 

Table  initialization  time  is  the  time  needed  to  initialize  the  distributed 
translation  table, 

Dereference  time  is  the  time  taken  by  the  dereference  Parti  primitive, 
and 

Scheduler  time  is  the  time  required  to  produce  the  communications 
schedule  once  the  required  processor  locations  and  local  indices  have 
been  found  by  dereference. 

In  Table  5  we  note  that  the  majority  of  the  costs  incurred  by  the  in¬ 
spector  are  due  to  the  translation  table  initialization  and  dereference.  For 
instance  consider  the  case  where  64  processors  are  used  to  carry  out  a  sweep 
over  a  2-D  block  partitioned  mesh  with  a  5  point  template.  The  translation 
table  initialization  and  dereference  together  require  183  %  of  the  executor 
time  while  the  generation  of  the  schedule  requires  only  12  %  of  the  executor 
time. 

In  the  problems  depicted  in  Table  5,  communication  costs  comprise  a 
rather  small  fraction  of  the  executor  time,  consequently  the  method  used 
to  partition  the  domain  does  not  make  a  significant  performance  impact  on 
executor  time.  In  Table  5,  the  costs  of  translation  table  initialization  and  of 
dereference  are  both  strongly  dependent  on  how  the  domain  is  partitioned. 
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2-D  block  partitioning  leads  to  higher  translation  table  related  costs,  this  is 
almost  certainly  due  to  the  increased  communication  requirements  needed 
for  translation  table  initialization  and  dereference.  Strip  partitioning  per  se 
does  not  necessarily  lead  to  low  translation  table  related  costs.  In  Table  4 
we  note  that  strip  partitioning  actually  leads  to  higher  inspector  costs  for 
both  Mesh  A  and  Mesh  B.  The  translation  table  is  partitioned  so  that  blocks 
of  contiguously  numbered  indices  are  assigned  to  each  processor.  However  in 
Mesh  A  and  Mesh  B,  mesh  points  are  not  numbered  in  a  regular  fashion  so 
the  indices  corresponding  to  a  domain  strip  are  not  contiguously  numbered. 

Table  5:  Cost  of  dereferencing  and  scheduling  on  different  number  of  pro¬ 
cessors _ 


nprocs 

executor 
time  (ms) 

table  init 
time  (ms) 

dereference 
time  (ms) 

schedule 
time  (ms) 
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5  point  template  synthetic  mesh  partioned  into  2-D  blocks 
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42 

32 
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26 

64 

158 

77 

212 

19 

5.4  Cost  of  translation  table 

In  Section  3.4  we  described  two  straightforward  ways  to  map  a  distributed 
translation  table  onto  processors.  We  consider  the  question  of  how  to  dis¬ 
tribute  the  translation  table  so  as  to  minimize  costs  associated  with  trans¬ 
lation  table  access.  Table  6  compares  the  time  required  to  carry  out  deref¬ 
erence  on  blocked  and  striped  translation  tables  by  depicting: 

the  time  required  to  carry  out  a  particular  call  to  dereference , 

the  average  number  of  non-local  accesses  to  table  entries  required  by 
dereference ,  and 

the  average  number  of  non-local  processors  accessed  during  the  call  to 
dereference. 
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When  we  examine  the  results  for  unstructured  Meshes  A  and  B,  we 
note  no  consistent  performance  difference  in  the  cost  required  to  dereference 
a  blocked  or  a  striped  translation  table.  Similar  numbers  of  off-processor 
table  entries  need  to  be  accessed  for  either  translation  table  distribution. 
Blocked  translation  tables  do  lead  to  superior  performance  when  we  use 
the  synthetic  meshes.  For  the  reasons  described  in  Section  5.3,  we  obtain 
particularly  good  results  when  we  use  a  striped  partition  with  a  blocked 
translation  table.  It  is  of  interest  to  note  that  the  blocked  translation  table 
also  proved  to  be  superior  when  we  used  synthetic  meshes  partitioned  in 
2-D  blocks. 


Table  6:  Cost  of  dereference  on  32  processors 
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5.5  Scheduler  and  Data  Exchanger  Performance 

To  quantify  the  communications  costs  incurred  by  the  Parti  scheduler  and 
data  exchange  primitives,  we  measured  the  time  required  to  carry  out  the 
scheduler ,  gather-exchanger  and  scatter-exchanger  procedure  calls  and  com¬ 
pared  them  to  the  hand  coded  version  of  iPSC/860  supplied  sends  and  re- 
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ceives;  the  sends  and  receives  communicated  the  same  amount  of  data  as 
did  the  Parti  procedures.  We  performed  an  experiment  in  which  two  pro¬ 
cessors  repeatedly  exchanged  W  single  precision  words  of  information.  The 
exchange  was  carried  out  using  gather-exchangers,  scatter-exchangers  and 
the  iPSC/860  supplied  send  and  receive  calls.  In  Table  7  we  depict  the  re¬ 
sults  of  these  experiments.  We  present  the  time  (in  milliseconds)  required  to 
carry  out  the  requisite  data  exchange  using  send  and  receive  messages.  We 
then  present  the  ratio  between  the  time  taken  by  the  scheduler  and  gather- 
exchanger  Parti  primitive  calls  and  the  time  taken  by  the  equivalent  send 
and  receive  calls.  The  scatter  exchanger  calls  were  also  timed,  the  results  of 
which  were  virtually  identical  to  that  of  the  corresponding  gather-exchanger 
call. 

iFrom  Table  7  we  see  that  gather-exchanger  took  no  more  than  20% 
more  time  than  explicitly  coded  send/receive  pairs  to  move  W  words  of 
information  between  two  processors.  The  additional  overhead  required  for 
scheduler  to  carry  out  the  data  exchange  was  a  factor  of  2.1  to  1.0  times  the 
cost  of  using  explicitly  coded  send/receive  pairs  to  move  W  words. 

6  Relation  to  Other  Work 

Programs  designed  to  carry  out  a  range  of  irregular  computations  includ¬ 
ing  sparse  direct  and  iterative  methods  require  many  of  the  optimizations 
described  in  this  paper.  Some  examples  of  such  programs  are  described  in 
[2],  [16],  [4],  [29]  and  [10]. 

Several  researchers  have  developed  programming  environments  that  are 
targeted  towards  particular  classes  of  irregular  or  adaptive  problems.  Williams 
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[29]  describes  a  programming  environment  (DIME)  for  calculations  with  un¬ 
structured  triangular  meshes  using  distributed  memory  machines.  Baden  [3] 
has  developed  a  programming  environment  targeted  towards  particle  com¬ 
putations,  this  programming  environment  provides  facilities  that  support 
dynamic  load  balancing. 

There  are  a  variety  of  compiler  projects  targeted  at  distributed  memory 
multiprocessors  [30],  [9],  [21],  [20],  [1],  [25].  With  the  exception  of  Kali 
project  [15],  and  the  Parti  work  described  here  and  in  [22],  [18],  and  [23]; 
these  compilers  do  not  attempt  to  efficiently  deal  with  loops  that  arise  in 
sparse  or  unstructured  scientific  computations. 

We  have  produced  and  benchmarked  a  prototype  compiler  that  is  able  to 
generate  code  capable  of  efficiently  handling  kernels  from  sparse  and  unstruc¬ 
tured  computations.  The  procedures  that  carry  out  runtime  optimizations 
are  coupled  to  a  distributed  memory  compiler  via  a  set  of  compiler  trans¬ 
formations.  The  compiler  described  and  tested  in  this  paper  is  qualitatively 
different  from  the  efforts  cited  above  in  a  number  of  important  respects. 
We  have  developed  and  demonstrated  mechanisms  that  allow  us  to  support 
irregularly  distributed  arrays.  Irregularly  distributed  arrays  must  be  sup¬ 
ported  in  order  to  make  it  possible  to  map  data  and  computational  work  in 
an  arbitrary  manner.  Because  we  can  support  irregularly  distributed  arrays, 
it  was  possible  for  us  to  compare  the  performance  effects  of  different  problem 
mappings  (Section  5).  Support  for  arbitrary  distributions  was  proposed  in 
[18]  and  [23]  but  to  our  knowledge,  this  is  the  first  implementation  of  a  com¬ 
piler  based  distributed  translation  table  mechanism  for  irregular  scientific 
problems. 

We  find  that  many  unstructured  NASA  codes  must  carry  out  data  ac¬ 
cumulations  to  off-processor  memory  locations.  We  chose  one  of  our  kernels 
to  demonstrate  this,  and  designed  our  primitives  and  compiler  to  be  able 
to  handle  this  situation.  To  our  knowledge,  our  compiler  effort  is  unique 
in  its  ability  to  efficiently  carry  out  irregular  patterns  of  off-processor  data 
accumulations. 

We  augment  our  primitives  with  a  hash  table  designed  to  eliminate  du¬ 
plicate  data  accesses.  In  addition,  we  use  the  hash  table  to  manage  copies 
of  off-processor  array  elements.  Other  researchers  have  used  different  data 
structures  for  management  of  off-processor  data  copies  [15]. 
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7  Conclusion 


In  this  paper  we  describe  and  experimentally  characterize  a  compiler  and 
runtime  support  procedures  which  embody  methods  that  are  capable  of 
handling  a  wide  range  of  irregular  problems  in  scientific  computing.  After 
examining  a  number  of  complete  NASA  codes,  we  chose  to  demonstrate  our 
methods  using  two  kernels  extracted  from  those  codes.  Both  of  these  kernels 
involved  computations  over  unstructured  meshes.  We  coded  both  kernels  in 
ARF,  our  dialect  of  Fortran,  and  generated  code  to  ran  on  the  nodes  of  the 
iPSC/860.  Detailed  timings  were  carried  out  on  both  kernels  using  unstruc¬ 
tured  meshes  from  aerodynamics,  along  with  meshes  that  were  generated 
by  using  random  numbers  to  incrementally  distort  matrices  obtained  from 
a  fixed  finite  difference  template.  This  benchmarking  suite  stressed  the 
communications  capabilities  of  the  iPSC/860  and  the  Parti  primitives  in  a 
variety  of  ways. 

In  the  experiments  we  reported  in  Section  5.2,  we  saw  that  that  the  ratio 
of  the  time  required  to  carry  out  all  preprocessing  to  the  time  required  for 
a  single  iteration  of  either  kernel  ranged  from  a  factor  of  0.7  to  a  factor  of 
3.6.  We  then  saw  in  Section  5.3  that  the  majority  of  the  preprocessing  costs 
arose  from  the  need  to  support  irregularly  distributed  arrays.  In  Section  5.5 
the  performance  of  the  scheduler  and  data  exchanger  Parti  primitives  were 
quantified.  The  data-exchangers  turned  out  to  be  at  most  20%  more  time 
consuming  than  the  analogous  send  and  receive  calls  provided  by  Intel. 

We  believe  that  one  of  the  virtues  of  our  layered  approach  to  distributed 
compiler  design  is  that  we  have  managed  to  capture  a  set  of  critical  op¬ 
timizations  in  our  runtime  support  primitives.  Our  primitives,  and  hence 
our  optimizations,  can  be  migrated  to  a  variety  of  compilers  targeted  to¬ 
wards  distributed  memory  multiprocessors.  We  intend  to  implement  these 
primitives  in  the  ParaScope  parallel  programming  environment  [13].  In  ad¬ 
dition,  Parti  primitives  can  and  are  being  used  directly  by  programmers  in 
applications  codes  [7]. 

Most  of  the  complexity  of  our  system  is  in  the  Parti  procedures.  The 
Parti  procedures  have  been  developed  so  that  that  transformations  needed 
to  embed  the  appropriate  primitives  can  be  implemented  with  relative  ease 
in  distributed  memory  compilers.  The  primitives  used  to  implement  the 
runtime  support  include  communications  procedures  designed  to  support 
irregular  patterns  of  distributed  array  access,  and  procedures  to  find  the 
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location  of  irregularly  mapped  distributed  array  data  using  distributed  trans¬ 
lation  tables.  Primitives  also  support  the  maintenance  of  hash  tables  used 
to  store  copies  of  off-processor  data. 
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